In the Credit data in the ISLR package it contains 400 customers and information on their credit history. For full information of the data look at the help file. A company has approached us to better understand factors that influence the Balance variable which is average credit card balance in USD. Using the information in the model discuss the influential factors, and discuss the factors you choose to put in the model. Do you have any concerns about the use of certain variables in the model? Discuss how your model was created and any insights you can provide based on the results. HINT: Adding Gender and/or Ethnicity could be controversial or illegal in some uses of this this model you should discuss your decision on these variables and how it effects the organizations ability to use your model for prediction or inference.
To get a better understanding of the features that influence the average credit card balance, we will explore various modeling methods as well as how they impact the variables we are working with. Since our primary focus in this section is inference, the need for any prediction accuracy metric is not needed for what we are doing. We will begin by doing our usual exploratory data analysis.
options(devise.ask.default = FALSE)
library(ISLR)
library(alr4)
library(lmtest)
library(statmod)
library(tweedie)
library(splines)
library(arm)
library(PRROC)
library(caret)
library(tidymodels)
library(yardstick)
library(dplyr)
library(rsample)
data(Credit)
Credit$Balance_plus <- Credit$Balance + 10
Credit$Balance_log <- log(Credit$Balance_plus)
Credit$Balance_sqrt <- sqrt(Credit$Balance)
Credit$Education_t <- cut(Credit$Education, breaks = c(5, 8, 20, 21), labels = c("low education", "mid education", "high education"), right = FALSE)
Credit$Education <- as.factor(Credit$Education_t)
Credit$Gender <- as.factor(Credit$Gender)
Credit$Student <- as.factor(Credit$Student)
Credit$Married <- as.factor(Credit$Married)
Credit$Ethnicity <- as.factor(Credit$Ethnicity)
head(Credit)
NA
summary(Credit)
ID Income Limit Rating Cards Age
Min. : 1.0 Min. : 10.35 Min. : 855 Min. : 93.0 Min. :1.000 Min. :23.00
1st Qu.:100.8 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2 1st Qu.:2.000 1st Qu.:41.75
Median :200.5 Median : 33.12 Median : 4622 Median :344.0 Median :3.000 Median :56.00
Mean :200.5 Mean : 45.22 Mean : 4736 Mean :354.9 Mean :2.958 Mean :55.67
3rd Qu.:300.2 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2 3rd Qu.:4.000 3rd Qu.:70.00
Max. :400.0 Max. :186.63 Max. :13913 Max. :982.0 Max. :9.000 Max. :98.00
Education Gender Student Married Ethnicity Balance Balance_plus
low education : 14 Male :193 No :360 No :155 African American: 99 Min. : 0.00 Min. : 10.00
mid education :384 Female:207 Yes: 40 Yes:245 Asian :102 1st Qu.: 68.75 1st Qu.: 78.75
high education: 2 Caucasian :199 Median : 459.50 Median : 469.50
Mean : 520.01 Mean : 530.01
3rd Qu.: 863.00 3rd Qu.: 873.00
Max. :1999.00 Max. :2009.00
Balance_log Balance_sqrt Education_t
Min. :2.303 Min. : 0.000 low education : 14
1st Qu.:4.366 1st Qu.: 8.292 mid education :384
Median :6.152 Median :21.436 high education: 2
Mean :5.366 Mean :18.919
3rd Qu.:6.772 3rd Qu.:29.377
Max. :7.605 Max. :44.710
We see we are working with a dataset of about 400 observations and 12 variables. Immediately we can see that certain variables may be due for some transformation. Since our response is Balance, that seems to go over several orders of magnitude. It also contains zeros. We will probably have to do a log transformation with this one, but before we do, we had to make sure we cover for any 0’s by creating another column, “Balance_plus”, as well as “Balance_log”.
Now we can begin to actually fit our model. We will start by fitting almost all of the variables without the need of doing any type of transformation. Balance will be our response variable with everything else as our predictors.
plot(Credit)
creditm1 <- lm(Balance~Income+Limit+Rating+Cards+Age+Education+Gender+Student+Married+Ethnicity, data = Credit)
print(summary(creditm1))
Call:
lm(formula = Balance ~ Income + Limit + Rating + Cards + Age +
Education + Gender + Student + Married + Ethnicity, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-160.61 -78.47 -13.68 54.22 320.01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -496.73700 39.14605 -12.689 < 2e-16 ***
Income -7.80077 0.23441 -33.279 < 2e-16 ***
Limit 0.19002 0.03283 5.788 1.47e-08 ***
Rating 1.15040 0.49176 2.339 0.0198 *
Cards 17.62529 4.35179 4.050 6.19e-05 ***
Age -0.64508 0.29583 -2.181 0.0298 *
Educationmid education 5.07187 27.23343 0.186 0.8524
Educationhigh education -56.16997 75.57786 -0.743 0.4578
GenderFemale -10.49340 9.93500 -1.056 0.2915
StudentYes 424.14840 16.73953 25.338 < 2e-16 ***
MarriedYes -9.31295 10.37748 -0.897 0.3701
EthnicityAsian 16.93173 14.16508 1.195 0.2327
EthnicityCaucasian 9.88104 12.27861 0.805 0.4215
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 98.88 on 387 degrees of freedom
Multiple R-squared: 0.9551, Adjusted R-squared: 0.9537
F-statistic: 686.6 on 12 and 387 DF, p-value: < 2.2e-16
We now have a fit. We see a lot of insignificance, especially between levels of certain factors, however before we do any type of interpretation, it is best that we check that our assumptions have been met.
par(mfrow=c(2,2))
plot(creditm1)
residualPlots(creditm1)
Test stat Pr(>|Test stat|)
Income 2.4618 0.01426 *
Limit 11.6937 < 2e-16 ***
Rating 11.1810 < 2e-16 ***
Cards -0.8020 0.42305
Age 0.0106 0.99155
Education
Gender
Student
Married
Ethnicity
Tukey test 25.0684 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
resettest(creditm1, power = 2:3, type = "regressor")
RESET test
data: creditm1
RESET = 44.179, df1 = 10, df2 = 377, p-value < 2.2e-16
resettest(creditm1, power = 2:3, type = "fitted")
RESET test
data: creditm1
RESET = 1771.5, df1 = 2, df2 = 385, p-value < 2.2e-16
ncvTest(creditm1)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 121.3496, Df = 1, p = < 2.22e-16
bptest(creditm1)
studentized Breusch-Pagan test
data: creditm1
BP = 132.36, df = 12, p-value < 2.2e-16
vif(creditm1)
GVIF Df GVIF^(1/(2*Df))
Income 2.785338 1 1.668933
Limit 234.327459 1 15.307758
Rating 236.253753 1 15.370548
Cards 1.453276 1 1.205519
Age 1.062740 1 1.030893
Education 1.046743 2 1.011486
Gender 1.008308 1 1.004145
Student 1.031760 1 1.015756
Married 1.045710 1 1.022600
Ethnicity 1.043780 2 1.010770
outlierTest(creditm1)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
infl <- data.frame(
hat = hatvalues(creditm1),
cooks = cooks.distance(creditm1),
rstandard = rstandard(creditm1),
rstudent = rstudent(creditm1)
)
infl <- infl[order(infl$cooks, decreasing = TRUE),]
print(head(infl))
Anova(creditm1, type = 'II')
Anova Table (Type II tests)
Response: Balance
Sum Sq Df F value Pr(>F)
Income 10827974 1 1107.4824 < 2.2e-16 ***
Limit 327559 1 33.5027 1.472e-08 ***
Rating 53506 1 5.4726 0.01982 *
Cards 160379 1 16.4035 6.187e-05 ***
Age 46488 1 4.7548 0.02982 *
Education 7605 2 0.3889 0.67806
Gender 10907 1 1.1156 0.29153
Student 6277106 1 642.0207 < 2.2e-16 ***
Married 7874 1 0.8054 0.37005
Ethnicity 14131 2 0.7227 0.48611
Residuals 3783741 387
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
avPlots(creditm1)
We see that we have pretty much violated almost all of our assumptions. We have heteroskedasticity, no linearity, multicollinearity, no normality of errors, etc. Before we try to fix this one model, we can see what transformations of the response does. We can try a square root transformation, a quadratic transformation, and a log transformation.
creditm2 <- lm(sqrt(Balance)~Income+Limit+Rating+Cards+Age+Education+Gender+Student+Married+Ethnicity, data = Credit)
print(summary(creditm2))
Call:
lm(formula = sqrt(Balance) ~ Income + Limit + Rating + Cards +
Age + Education + Gender + Student + Married + Ethnicity,
data = Credit)
Residuals:
Min 1Q Median 3Q Max
-9.6454 -1.1846 0.8389 1.9360 4.5213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.2064847 1.0929837 -7.508 4.17e-13 ***
Income -0.2538834 0.0065448 -38.792 < 2e-16 ***
Limit 0.0064856 0.0009166 7.076 7.02e-12 ***
Rating 0.0186496 0.0137302 1.358 0.175161
Cards 0.4340006 0.1215050 3.572 0.000399 ***
Age -0.0166527 0.0082599 -2.016 0.044481 *
Educationmid education -0.5968170 0.7603755 -0.785 0.432994
Educationhigh education 1.2572061 2.1101841 0.596 0.551670
GenderFemale -0.1235043 0.2773918 -0.445 0.656399
StudentYes 10.7269177 0.4673788 22.951 < 2e-16 ***
MarriedYes 0.1777273 0.2897461 0.613 0.539979
EthnicityAsian 0.0269471 0.3954983 0.068 0.945714
EthnicityCaucasian 0.7210210 0.3428269 2.103 0.036097 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.761 on 387 degrees of freedom
Multiple R-squared: 0.9545, Adjusted R-squared: 0.9531
F-statistic: 676.6 on 12 and 387 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(creditm2)
residualPlots(creditm2)
Test stat Pr(>|Test stat|)
Income -2.4381 0.01522 *
Limit -7.9510 2.064e-14 ***
Rating -8.4508 5.986e-16 ***
Cards 0.7647 0.44493
Age -0.5000 0.61733
Education
Gender
Student
Married
Ethnicity
Tukey test -9.9094 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
resettest(creditm2, power = 2:3, type = "regressor")
RESET test
data: creditm2
RESET = 12.484, df1 = 10, df2 = 377, p-value < 2.2e-16
resettest(creditm2, power = 2:3, type = "fitted")
RESET test
data: creditm2
RESET = 130.22, df1 = 2, df2 = 385, p-value < 2.2e-16
ncvTest(creditm2)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 11.79597, Df = 1, p = 0.00059359
bptest(creditm2)
studentized Breusch-Pagan test
data: creditm2
BP = 27.643, df = 12, p-value = 0.006237
vif(creditm2)
GVIF Df GVIF^(1/(2*Df))
Income 2.785338 1 1.668933
Limit 234.327459 1 15.307758
Rating 236.253753 1 15.370548
Cards 1.453276 1 1.205519
Age 1.062740 1 1.030893
Education 1.046743 2 1.011486
Gender 1.008308 1 1.004145
Student 1.031760 1 1.015756
Married 1.045710 1 1.022600
Ethnicity 1.043780 2 1.010770
outlierTest(creditm2)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
infl <- data.frame(
hat = hatvalues(creditm2),
cooks = cooks.distance(creditm2),
rstandard = rstandard(creditm2),
rstudent = rstudent(creditm2)
)
infl <- infl[order(infl$cooks, decreasing = TRUE),]
print(head(infl))
Anova(creditm2, type = 'II')
Anova Table (Type II tests)
Response: sqrt(Balance)
Sum Sq Df F value Pr(>F)
Income 11469.4 1 1504.8002 < 2.2e-16 ***
Limit 381.6 1 50.0646 7.024e-12 ***
Rating 14.1 1 1.8450 0.1751611
Cards 97.2 1 12.7583 0.0003991 ***
Age 31.0 1 4.0646 0.0444807 *
Education 11.2 2 0.7379 0.4787936
Gender 1.5 1 0.1982 0.6563994
Student 4014.9 1 526.7588 < 2.2e-16 ***
Married 2.9 1 0.3762 0.5399791
Ethnicity 49.5 2 3.2442 0.0400615 *
Residuals 2949.7 387
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
avPlots(creditm2)
Once again we see a similar story, everything was violated. The next one we will try is a quadratic transformation.
creditm3 <- lm(I(Balance^2)~Income+Limit+Rating+Cards+Age+Education+Gender+Student+Married+Ethnicity, data = Credit)
print(summary(creditm3))
Call:
lm(formula = I(Balance^2) ~ Income + Limit + Rating + Cards +
Age + Education + Gender + Student + Married + Ethnicity,
data = Credit)
Residuals:
Min 1Q Median 3Q Max
-603699 -199182 -73658 162696 1666101
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -841244.50 115277.87 -7.298 1.68e-12 ***
Income -6706.63 690.28 -9.716 < 2e-16 ***
Limit 156.17 96.68 1.615 0.1070
Rating 2278.02 1448.13 1.573 0.1165
Cards 29119.67 12815.23 2.272 0.0236 *
Age -1040.52 871.18 -1.194 0.2331
Educationmid education 22762.14 80197.41 0.284 0.7767
Educationhigh education -173675.25 222562.81 -0.780 0.4357
GenderFemale -18808.32 29256.74 -0.643 0.5207
StudentYes 598672.45 49294.82 12.145 < 2e-16 ***
MarriedYes -29306.40 30559.76 -0.959 0.3382
EthnicityAsian 26658.97 41713.53 0.639 0.5231
EthnicityCaucasian -21072.38 36158.23 -0.583 0.5604
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 291200 on 387 degrees of freedom
Multiple R-squared: 0.8001, Adjusted R-squared: 0.7939
F-statistic: 129.1 on 12 and 387 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(creditm3)
residualPlots(creditm3)
Test stat Pr(>|Test stat|)
Income 5.0895 5.62e-07 ***
Limit 22.8942 < 2.2e-16 ***
Rating 22.9439 < 2.2e-16 ***
Cards -0.8151 0.4155
Age 0.5449 0.5861
Education
Gender
Student
Married
Ethnicity
Tukey test 74.8441 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
resettest(creditm3, power = 2:3, type = "regressor")
RESET test
data: creditm3
RESET = 86.051, df1 = 10, df2 = 377, p-value < 2.2e-16
resettest(creditm3, power = 2:3, type = "fitted")
RESET test
data: creditm3
RESET = 2917, df1 = 2, df2 = 385, p-value < 2.2e-16
ncvTest(creditm3)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 85.14069, Df = 1, p = < 2.22e-16
bptest(creditm3)
studentized Breusch-Pagan test
data: creditm3
BP = 78.053, df = 12, p-value = 9.694e-12
vif(creditm3)
GVIF Df GVIF^(1/(2*Df))
Income 2.785338 1 1.668933
Limit 234.327459 1 15.307758
Rating 236.253753 1 15.370548
Cards 1.453276 1 1.205519
Age 1.062740 1 1.030893
Education 1.046743 2 1.011486
Gender 1.008308 1 1.004145
Student 1.031760 1 1.015756
Married 1.045710 1 1.022600
Ethnicity 1.043780 2 1.010770
outlierTest(creditm3)
infl <- data.frame(
hat = hatvalues(creditm3),
cooks = cooks.distance(creditm3),
rstudent = rstudent(creditm3),
rstandard = rstandard(creditm3)
)
infl <- infl[order(infl$cooks, decreasing = TRUE), ]
print(head(infl))
Anova(creditm3, type = 'II')
Anova Table (Type II tests)
Response: I(Balance^2)
Sum Sq Df F value Pr(>F)
Income 8.0035e+12 1 94.3962 < 2e-16 ***
Limit 2.2124e+11 1 2.6094 0.10705
Rating 2.0981e+11 1 2.4746 0.11652
Cards 4.3777e+11 1 5.1632 0.02362 *
Age 1.2095e+11 1 1.4266 0.23306
Education 8.1417e+10 2 0.4801 0.61907
Gender 3.5041e+10 1 0.4133 0.52069
Student 1.2506e+13 1 147.4945 < 2e-16 ***
Married 7.7974e+10 1 0.9197 0.33816
Ethnicity 1.5298e+11 2 0.9022 0.40654
Residuals 3.2812e+13 387
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
avPlots(creditm3)
Same story, lastly we will try the log transformation.
creditm4 <- lm(Balance_log~Income+Limit+Rating+Cards+Age+Education+Gender+Student+Married+Ethnicity, data = Credit)
print(summary(creditm4))
Call:
lm(formula = Balance_log ~ Income + Limit + Rating + Cards +
Age + Education + Gender + Student + Married + Ethnicity,
data = Credit)
Residuals:
Min 1Q Median 3Q Max
-2.3152 -0.4827 0.1120 0.6195 1.1458
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8797980 0.2920928 6.436 3.64e-10 ***
Income -0.0371101 0.0017490 -21.217 < 2e-16 ***
Limit 0.0009766 0.0002450 3.987 8.01e-05 ***
Rating 0.0012247 0.0036693 0.334 0.7387
Cards 0.0544995 0.0324714 1.678 0.0941 .
Age -0.0022433 0.0022074 -1.016 0.3101
Educationmid education -0.1758029 0.2032054 -0.865 0.3875
Educationhigh education 0.5115178 0.5639331 0.907 0.3649
GenderFemale 0.0089602 0.0741312 0.121 0.9039
StudentYes 1.3224078 0.1249040 10.587 < 2e-16 ***
MarriedYes 0.0700029 0.0774328 0.904 0.3665
EthnicityAsian -0.0547513 0.1056944 -0.518 0.6047
EthnicityCaucasian 0.1397372 0.0916183 1.525 0.1280
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7378 on 387 degrees of freedom
Multiple R-squared: 0.84, Adjusted R-squared: 0.835
F-statistic: 169.3 on 12 and 387 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(creditm4)
residualPlots(creditm4)
Test stat Pr(>|Test stat|)
Income -2.9232 0.003669 **
Limit -12.1291 < 2.2e-16 ***
Rating -12.4418 < 2.2e-16 ***
Cards 1.1154 0.265361
Age -0.4757 0.634563
Education
Gender
Student
Married
Ethnicity
Tukey test -20.8820 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
resettest(creditm4, power = 2:3, type = 'regressor')
RESET test
data: creditm4
RESET = 23.921, df1 = 10, df2 = 377, p-value < 2.2e-16
resettest(creditm4, power = 2:3, type = 'fitted')
RESET test
data: creditm4
RESET = 235.02, df1 = 2, df2 = 385, p-value < 2.2e-16
ncvTest(creditm4)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 2.265877, Df = 1, p = 0.13225
bptest(creditm4)
studentized Breusch-Pagan test
data: creditm4
BP = 34.446, df = 12, p-value = 0.0005735
vif(creditm4)
GVIF Df GVIF^(1/(2*Df))
Income 2.785338 1 1.668933
Limit 234.327459 1 15.307758
Rating 236.253753 1 15.370548
Cards 1.453276 1 1.205519
Age 1.062740 1 1.030893
Education 1.046743 2 1.011486
Gender 1.008308 1 1.004145
Student 1.031760 1 1.015756
Married 1.045710 1 1.022600
Ethnicity 1.043780 2 1.010770
outlierTest(creditm4)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
infl <- data.frame(
hat = hatvalues(creditm4),
cooks = cooks.distance(creditm4),
rstandard = rstandard(creditm4),
rstudent = rstudent(creditm4)
)
infl <- infl[order(infl$cooks, decreasing = TRUE),]
print(head(infl))
Based off of all of these, square root model actually performed the best with the constant variance being fixed, however there were still other violations that prevents us from choosing that model for any valid inferences. We can try a power-transformations to see what they suggest for us.
summary(powerTransform(cbind(Balance_plus, Income, Limit, Rating, Age, Cards) ~ 1, Credit))
bcPower Transformations to Multinormality
Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Balance_plus 0.4824 0.50 0.4310 0.5338
Income 0.3941 0.50 0.2880 0.5003
Limit 0.7820 0.78 0.7150 0.8490
Rating 0.7206 0.72 0.6415 0.7996
Age 0.8003 1.00 0.4850 1.1155
Cards 0.3664 0.50 0.1998 0.5331
Likelihood ratio test that transformation parameters are equal to 0
(all log transformations)
Likelihood ratio test that no transformations are needed
We see from our multivariate transformation, that we would need to transform our numerical variables to something that would be hard for us to explain without a visual effects plot. We will go ahead and fit the model with the recommendations and check to see if our assumptions have been met.
creditpre <- lm(I(Balance_plus^0.48)~I(Income^0.38)+I(Limit^0.77)+I(Rating^0.69)+I(Cards^0.37)+I(Age^0.81)+Education+Gender+Student+Married+Ethnicity, data = Credit)
print(summary(creditpre))
Call:
lm(formula = I(Balance_plus^0.48) ~ I(Income^0.38) + I(Limit^0.77) +
I(Rating^0.69) + I(Cards^0.37) + I(Age^0.81) + Education +
Gender + Student + Married + Ethnicity, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-7.690 -1.172 0.357 1.664 5.374
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.027024 1.344525 -0.764 0.44542
I(Income^0.38) -5.225395 0.160942 -32.468 < 2e-16 ***
I(Limit^0.77) 0.031879 0.006683 4.770 2.61e-06 ***
I(Rating^0.69) 0.296188 0.098226 3.015 0.00274 **
I(Cards^0.37) 1.093753 0.535981 2.041 0.04196 *
I(Age^0.81) -0.039091 0.018074 -2.163 0.03117 *
Educationmid education -0.066462 0.632721 -0.105 0.91640
Educationhigh education 1.529661 1.755909 0.871 0.38421
GenderFemale -0.083688 0.230803 -0.363 0.71711
StudentYes 8.257797 0.388824 21.238 < 2e-16 ***
MarriedYes -0.080067 0.241517 -0.332 0.74043
EthnicityAsian 0.144763 0.329301 0.440 0.66047
EthnicityCaucasian 0.690286 0.285358 2.419 0.01602 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.298 on 387 degrees of freedom
Multiple R-squared: 0.9494, Adjusted R-squared: 0.9478
F-statistic: 605.3 on 12 and 387 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(creditpre)
residualPlots(creditpre)
Test stat Pr(>|Test stat|)
I(Income^0.38) -10.6366 < 2.2e-16 ***
I(Limit^0.77) -4.6499 4.567e-06 ***
I(Rating^0.69) -5.0038 8.558e-07 ***
I(Cards^0.37) 0.8762 0.38149
I(Age^0.81) -0.7863 0.43220
Education
Gender
Student
Married
Ethnicity
Tukey test -2.3330 0.01965 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
resettest(creditpre, power = 2:3, type ='regressor')
RESET test
data: creditpre
RESET = 25.364, df1 = 10, df2 = 377, p-value < 2.2e-16
resettest(creditpre, power = 2:3, type = 'fitted')
RESET test
data: creditpre
RESET = 117.78, df1 = 2, df2 = 385, p-value < 2.2e-16
ncvTest(creditpre)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 26.32225, Df = 1, p = 2.8894e-07
bptest(creditpre)
studentized Breusch-Pagan test
data: creditpre
BP = 33.543, df = 12, p-value = 0.0007959
vif(creditpre)
GVIF Df GVIF^(1/(2*Df))
I(Income^0.38) 2.370845 1 1.539755
I(Limit^0.77) 212.785314 1 14.587163
I(Rating^0.69) 210.946329 1 14.523991
I(Cards^0.37) 1.453168 1 1.205474
I(Age^0.81) 1.062079 1 1.030572
Education 1.045337 2 1.011147
Gender 1.007287 1 1.003637
Student 1.030407 1 1.015090
Married 1.048421 1 1.023924
Ethnicity 1.045153 2 1.011102
outlierTest(creditpre)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
infl <- data.frame(
hat = hatvalues(creditpre),
cooks = cooks.distance(creditpre),
rstandard = rstandard(creditpre),
rstudent = rstudent(creditpre)
)
infl <- infl[order(infl$cooks, decreasing = TRUE),]
print(head(infl))
Anova(creditpre, type = 'II')
Anova Table (Type II tests)
Response: I(Balance_plus^0.48)
Sum Sq Df F value Pr(>F)
I(Income^0.38) 5568.0 1 1054.1445 < 2.2e-16 ***
I(Limit^0.77) 120.2 1 22.7528 2.614e-06 ***
I(Rating^0.69) 48.0 1 9.0925 0.002736 **
I(Cards^0.37) 22.0 1 4.1643 0.041963 *
I(Age^0.81) 24.7 1 4.6776 0.031169 *
Education 5.0 2 0.4742 0.622767
Gender 0.7 1 0.1315 0.717106
Student 2382.4 1 451.0482 < 2.2e-16 ***
Married 0.6 1 0.1099 0.740433
Ethnicity 38.6 2 3.6511 0.026862 *
Residuals 2044.1 387
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
avPlots(creditpre)
The multivariate transformation was supposed to fix our assumption of normality, however we do see some of the tails that are a bit off, possibly due to outliers. Our assumption of homoscedasticity was fixed. We still see a violation of linearity as well as correlated variables in Rating and Limit. We can present this as is, or we can try to fit other higher ordered terms to see if we get an even better inference.
credittrue <- lm(Balance_log~bs(log(Income), df = 8)+bs(log(Limit), df = 8)+bs(Rating, df = 4)+bs(log(Income), df = 8)*Student+bs(log(Limit), df = 8)*Student+Cards+Age+Education+Gender+Student+Married+Ethnicity, data = Credit)
print(summary(credittrue))
Call:
lm(formula = Balance_log ~ bs(log(Income), df = 8) + bs(log(Limit),
df = 8) + bs(Rating, df = 4) + bs(log(Income), df = 8) *
Student + bs(log(Limit), df = 8) * Student + Cards + Age +
Education + Gender + Student + Married + Ethnicity, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-2.23459 -0.17922 0.05146 0.25045 1.16902
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.991606 0.409313 4.866 1.72e-06 ***
bs(log(Income), df = 8)1 0.029857 0.346798 0.086 0.931440
bs(log(Income), df = 8)2 -0.212919 0.227787 -0.935 0.350566
bs(log(Income), df = 8)3 -0.348524 0.226753 -1.537 0.125183
bs(log(Income), df = 8)4 -0.777756 0.193150 -4.027 6.92e-05 ***
bs(log(Income), df = 8)5 -1.147392 0.221144 -5.188 3.58e-07 ***
bs(log(Income), df = 8)6 -2.375516 0.292470 -8.122 7.70e-15 ***
bs(log(Income), df = 8)7 -2.814527 0.405300 -6.944 1.83e-11 ***
bs(log(Income), df = 8)8 -3.002766 0.488433 -6.148 2.12e-09 ***
bs(log(Limit), df = 8)1 2.115477 0.469686 4.504 9.07e-06 ***
bs(log(Limit), df = 8)2 -1.954757 0.530163 -3.687 0.000262 ***
bs(log(Limit), df = 8)3 2.443621 0.661829 3.692 0.000257 ***
bs(log(Limit), df = 8)4 3.115935 0.744033 4.188 3.56e-05 ***
bs(log(Limit), df = 8)5 3.848833 0.841901 4.572 6.70e-06 ***
bs(log(Limit), df = 8)6 6.029782 1.135839 5.309 1.95e-07 ***
bs(log(Limit), df = 8)7 9.818801 1.853127 5.299 2.06e-07 ***
bs(log(Limit), df = 8)8 10.042406 3.256922 3.083 0.002207 **
bs(Rating, df = 4)1 0.465521 0.885357 0.526 0.599356
bs(Rating, df = 4)2 3.837305 1.475365 2.601 0.009687 **
bs(Rating, df = 4)3 -2.442739 2.228326 -1.096 0.273727
bs(Rating, df = 4)4 -1.256003 3.361197 -0.374 0.708868
StudentYes 0.916077 1.103797 0.830 0.407136
Cards 0.049483 0.020760 2.384 0.017671 *
Age -0.002164 0.001383 -1.565 0.118455
Educationmid education -0.124034 0.124660 -0.995 0.320425
Educationhigh education 0.035968 0.343191 0.105 0.916591
GenderFemale -0.054640 0.046869 -1.166 0.244481
MarriedYes 0.025525 0.048528 0.526 0.599233
EthnicityAsian 0.035349 0.066212 0.534 0.593763
EthnicityCaucasian 0.082205 0.057689 1.425 0.155045
bs(log(Income), df = 8)1:StudentYes -1.405774 1.890382 -0.744 0.457584
bs(log(Income), df = 8)2:StudentYes -0.290814 0.741387 -0.392 0.695105
bs(log(Income), df = 8)3:StudentYes -0.340203 0.913975 -0.372 0.709950
bs(log(Income), df = 8)4:StudentYes -0.095470 0.739834 -0.129 0.897397
bs(log(Income), df = 8)5:StudentYes 0.022294 0.818649 0.027 0.978290
bs(log(Income), df = 8)6:StudentYes 1.011247 1.352366 0.748 0.455100
bs(log(Income), df = 8)7:StudentYes 1.217211 1.037473 1.173 0.241486
bs(log(Income), df = 8)8:StudentYes 0.726799 1.179383 0.616 0.538123
bs(log(Limit), df = 8)1:StudentYes -1.553166 1.527164 -1.017 0.309836
bs(log(Limit), df = 8)2:StudentYes 5.509351 0.958799 5.746 1.97e-08 ***
bs(log(Limit), df = 8)3:StudentYes 0.566094 0.972210 0.582 0.560752
bs(log(Limit), df = 8)4:StudentYes 0.067911 0.848017 0.080 0.936217
bs(log(Limit), df = 8)5:StudentYes -0.198461 0.871075 -0.228 0.819907
bs(log(Limit), df = 8)6:StudentYes -1.456130 1.737644 -0.838 0.402601
bs(log(Limit), df = 8)7:StudentYes -1.802338 3.356128 -0.537 0.591585
bs(log(Limit), df = 8)8:StudentYes 0.374831 12.186300 0.031 0.975480
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.442 on 354 degrees of freedom
Multiple R-squared: 0.9475, Adjusted R-squared: 0.9408
F-statistic: 141.9 on 45 and 354 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(credittrue)
residualPlots(credittrue)
Test stat Pr(>|Test stat|)
bs(log(Income), df = 8)
bs(log(Limit), df = 8)
bs(Rating, df = 4)
Student
Cards 2.4633 0.01424 *
Age -1.0000 0.31800
Education
Gender
Married
Ethnicity
Tukey test -6.0918 1.116e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
resettest(credittrue, power = 2:3, type = 'regressor')
RESET test
data: credittrue
RESET = 1.6243, df1 = 44, df2 = 310, p-value = 0.01036
resettest(credittrue, power = 2:3, type = 'fitted')
RESET test
data: credittrue
RESET = 255.05, df1 = 2, df2 = 352, p-value < 2.2e-16
ncvTest(credittrue)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 66.85043, Df = 1, p = 2.9291e-16
bptest(credittrue)
studentized Breusch-Pagan test
data: credittrue
BP = 96.683, df = 45, p-value = 1.217e-05
vif(credittrue)
GVIF Df GVIF^(1/(2*Df))
bs(log(Income), df = 8) 6.453483e+01 8 1.297514
bs(log(Limit), df = 8) 4.583047e+07 8 3.011772
bs(Rating, df = 4) 1.823046e+07 4 8.083505
Student 2.244633e+02 1 14.982098
Cards 1.654751e+00 1 1.286371
Age 1.161355e+00 1 1.077662
Education 1.133720e+00 2 1.031874
Gender 1.122806e+00 1 1.059625
Married 1.144157e+00 1 1.069653
Ethnicity 1.254636e+00 2 1.058350
bs(log(Income), df = 8):Student 6.783345e+03 8 1.735662
bs(log(Limit), df = 8):Student 8.228509e+03 8 1.756740
outlierTest(credittrue)
infl <- data.frame(
hat = hatvalues(credittrue),
cooks = cooks.distance(credittrue),
rstandard = rstandard(credittrue),
rstudent = rstudent(credittrue)
)
infl <- infl[order(infl$cooks, decreasing = TRUE),]
print(head(infl))
Anova(credittrue, type = 'II')
Anova Table (Type II tests)
Response: Balance_log
Sum Sq Df F value Pr(>F)
bs(log(Income), df = 8) 87.523 8 55.9881 < 2e-16 ***
bs(log(Limit), df = 8) 28.130 8 17.9945 < 2e-16 ***
bs(Rating, df = 4) 1.796 4 2.2982 0.05862 .
Student 48.110 1 246.2072 < 2e-16 ***
Cards 1.110 1 5.6815 0.01767 *
Age 0.479 1 2.4495 0.11846
Education 0.238 2 0.6099 0.54398
Gender 0.266 1 1.3591 0.24448
Married 0.054 1 0.2767 0.59923
Ethnicity 0.425 2 1.0864 0.33857
bs(log(Income), df = 8):Student 2.118 8 1.3550 0.21523
bs(log(Limit), df = 8):Student 26.495 8 16.9485 < 2e-16 ***
Residuals 69.173 354
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
avPlots(credittrue)
NA
NA
NA
Now this model is more of a generalized additive model that uses b-splines. This time we are using the log version of the model as we noted earlier, that covers several magnitudes when we look at its range. This model fixes the linearity assumption, however it does violate the assumption of homoscedasticity unlike the previous model. We have a lot of multicollinearity, but that is expected when dealing with basis splines and other GAMs, we do know that Rating and Limit have consistently been co-linear even from our previous simpler models.
Looking at the ANOVA, almost every single variable looks to be significant with the exception of a few categorical ones, meaning we don’t really have to remove anything, we can still try to fit a sub-model without several of them and then use our ANOVA to test whether they give us the same information.
credittrue2 <- lm(Balance_log~bs(log(Income), df = 8)+bs(log(Limit), df = 8)+bs(log(Income), df = 8)*Student+bs(log(Limit), df = 8)*Student+Cards+Education+Gender+Student+Ethnicity, data = Credit)
print(summary(credittrue2))
Call:
lm(formula = Balance_log ~ bs(log(Income), df = 8) + bs(log(Limit),
df = 8) + bs(log(Income), df = 8) * Student + bs(log(Limit),
df = 8) * Student + Cards + Education + Gender + Student +
Ethnicity, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-2.2217 -0.1802 0.0546 0.2555 1.2242
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.046323 0.340673 6.007 4.63e-09 ***
bs(log(Income), df = 8)1 0.009179 0.348119 0.026 0.9790
bs(log(Income), df = 8)2 -0.226223 0.227776 -0.993 0.3213
bs(log(Income), df = 8)3 -0.390428 0.227692 -1.715 0.0873 .
bs(log(Income), df = 8)4 -0.767501 0.194430 -3.947 9.50e-05 ***
bs(log(Income), df = 8)5 -1.243114 0.220103 -5.648 3.30e-08 ***
bs(log(Income), df = 8)6 -2.282454 0.292056 -7.815 6.10e-14 ***
bs(log(Income), df = 8)7 -2.928902 0.406139 -7.212 3.29e-12 ***
bs(log(Income), df = 8)8 -3.049308 0.489770 -6.226 1.33e-09 ***
bs(log(Limit), df = 8)1 2.086579 0.465935 4.478 1.01e-05 ***
bs(log(Limit), df = 8)2 -1.764082 0.318778 -5.534 6.04e-08 ***
bs(log(Limit), df = 8)3 3.570315 0.291061 12.267 < 2e-16 ***
bs(log(Limit), df = 8)4 4.568382 0.273767 16.687 < 2e-16 ***
bs(log(Limit), df = 8)5 5.331634 0.281899 18.913 < 2e-16 ***
bs(log(Limit), df = 8)6 7.100281 0.357066 19.885 < 2e-16 ***
bs(log(Limit), df = 8)7 8.211449 0.492983 16.657 < 2e-16 ***
bs(log(Limit), df = 8)8 8.502704 0.618839 13.740 < 2e-16 ***
StudentYes 0.761222 1.053280 0.723 0.4703
Cards 0.053701 0.017073 3.145 0.0018 **
Educationmid education -0.109216 0.124624 -0.876 0.3814
Educationhigh education 0.104317 0.342873 0.304 0.7611
GenderFemale -0.064443 0.046982 -1.372 0.1710
EthnicityAsian 0.043423 0.065292 0.665 0.5064
EthnicityCaucasian 0.087805 0.057673 1.522 0.1288
bs(log(Income), df = 8)1:StudentYes -1.171944 1.885248 -0.622 0.5346
bs(log(Income), df = 8)2:StudentYes -0.413607 0.736720 -0.561 0.5749
bs(log(Income), df = 8)3:StudentYes -0.222442 0.912999 -0.244 0.8076
bs(log(Income), df = 8)4:StudentYes -0.086307 0.741143 -0.116 0.9074
bs(log(Income), df = 8)5:StudentYes 0.169857 0.815009 0.208 0.8350
bs(log(Income), df = 8)6:StudentYes 1.023978 1.356611 0.755 0.4509
bs(log(Income), df = 8)7:StudentYes 1.107761 1.042088 1.063 0.2885
bs(log(Income), df = 8)8:StudentYes 0.840402 1.173741 0.716 0.4745
bs(log(Limit), df = 8)1:StudentYes -1.360759 1.467742 -0.927 0.3545
bs(log(Limit), df = 8)2:StudentYes 5.645960 0.938314 6.017 4.37e-09 ***
bs(log(Limit), df = 8)3:StudentYes 0.635661 0.924977 0.687 0.4924
bs(log(Limit), df = 8)4:StudentYes 0.227172 0.807745 0.281 0.7787
bs(log(Limit), df = 8)5:StudentYes -0.142554 0.823762 -0.173 0.8627
bs(log(Limit), df = 8)6:StudentYes -1.383242 1.708651 -0.810 0.4187
bs(log(Limit), df = 8)7:StudentYes -1.094757 3.309877 -0.331 0.7410
bs(log(Limit), df = 8)8:StudentYes -2.712221 12.077109 -0.225 0.8224
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4455 on 360 degrees of freedom
Multiple R-squared: 0.9457, Adjusted R-squared: 0.9399
F-statistic: 160.9 on 39 and 360 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(credittrue2)
residualPlots(credittrue2)
Test stat Pr(>|Test stat|)
bs(log(Income), df = 8)
bs(log(Limit), df = 8)
Student
Cards 2.5227 0.01208 *
Education
Gender
Ethnicity
Tukey test -5.7966 6.766e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
resettest(credittrue2, power = 2:3, type = 'regressor')
RESET test
data: credittrue2
RESET = 2.0149, df1 = 34, df2 = 326, p-value = 0.001024
resettest(credittrue2, power = 2:3, type = 'fitted')
RESET test
data: credittrue2
RESET = 267.6, df1 = 2, df2 = 358, p-value < 2.2e-16
ncvTest(credittrue2)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 69.70785, Df = 1, p = < 2.22e-16
bptest(credittrue2)
studentized Breusch-Pagan test
data: credittrue2
BP = 97.947, df = 39, p-value = 5.632e-07
vif(credittrue2)
GVIF Df GVIF^(1/(2*Df))
bs(log(Income), df = 8) 57.436113 8 1.288098
bs(log(Limit), df = 8) 50.998326 8 1.278563
Student 201.265526 1 14.186808
Cards 1.102060 1 1.049790
Education 1.098705 2 1.023812
Gender 1.111007 1 1.054043
Ethnicity 1.197650 2 1.046122
bs(log(Income), df = 8):Student 5912.637234 8 1.720823
bs(log(Limit), df = 8):Student 6010.998940 8 1.722599
outlierTest(credittrue2)
infl <- data.frame(
hat = hatvalues(credittrue2),
cooks = cooks.distance(credittrue2),
rstandard = rstandard(credittrue2),
rstudent = rstudent(credittrue2)
)
infl <- infl[order(infl$cooks, decreasing = TRUE),]
print(head(infl))
Anova(credittrue2, type = 'II')
Anova Table (Type II tests)
Response: Balance_log
Sum Sq Df F value Pr(>F)
bs(log(Income), df = 8) 89.50 8 56.3801 < 2.2e-16 ***
bs(log(Limit), df = 8) 964.09 8 607.3076 < 2.2e-16 ***
Student 48.68 1 245.3093 < 2.2e-16 ***
Cards 1.96 1 9.8936 0.001797 **
Education 0.24 2 0.5923 0.553616
Gender 0.37 1 1.8814 0.171028
Ethnicity 0.47 2 1.1955 0.303753
bs(log(Income), df = 8):Student 2.25 8 1.4198 0.186572
bs(log(Limit), df = 8):Student 27.02 8 17.0207 < 2.2e-16 ***
Residuals 71.44 360
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
avPlots(credittrue2)
NA
NA
NA
anova(credittrue2, credittrue)
Analysis of Variance Table
Model 1: Balance_log ~ bs(log(Income), df = 8) + bs(log(Limit), df = 8) +
bs(log(Income), df = 8) * Student + bs(log(Limit), df = 8) *
Student + Cards + Education + Gender + Student + Ethnicity
Model 2: Balance_log ~ bs(log(Income), df = 8) + bs(log(Limit), df = 8) +
bs(Rating, df = 4) + bs(log(Income), df = 8) * Student +
bs(log(Limit), df = 8) * Student + Cards + Age + Education +
Gender + Student + Married + Ethnicity
Res.Df RSS Df Sum of Sq F Pr(>F)
1 360 71.437
2 354 69.173 6 2.2636 1.9307 0.07508 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The ANOVA tells us that both models are the not same, so we will go with the more complex, but let us do some more diagnostics with the simpler ones anyway.
There is one last thing we must do, I want to look at the importance of several categorical variable and whether we can get away with a simpler model. To do that, we will do an ANOVA to compare the model we just made now, with a simpler model only using the cubic basis-splines for Income, Limit, Student and a combination of those with Student. All other variables we currently have, will be dropped on the predictor side.
credittrue3 <- lm(Balance_log~bs(log(Income), df = 8)+bs(log(Limit), df = 8)+bs(log(Income), df = 8)*Student+bs(log(Limit), df = 8)*Student+Student, data = Credit)
anova(credittrue3, credittrue2)
Analysis of Variance Table
Model 1: Balance_log ~ bs(log(Income), df = 8) + bs(log(Limit), df = 8) +
bs(log(Income), df = 8) * Student + bs(log(Limit), df = 8) *
Student + Student
Model 2: Balance_log ~ bs(log(Income), df = 8) + bs(log(Limit), df = 8) +
bs(log(Income), df = 8) * Student + bs(log(Limit), df = 8) *
Student + Cards + Education + Gender + Student + Ethnicity
Res.Df RSS Df Sum of Sq F Pr(>F)
1 366 74.538
2 360 71.437 6 3.1014 2.6049 0.01747 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The ANOVA tells us that we can NOT get away with an even simpler model using just the three variables we discussed earlier, BUT let’s run the diagnostics for this anyway!
par(mfrow=c(2,2))
plot(credittrue3)
residualPlots(credittrue3)
Test stat Pr(>|Test stat|)
bs(log(Income), df = 8)
bs(log(Limit), df = 8)
Student
Tukey test -4.3653 1.269e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
resettest(credittrue3, power = 2:3, type = 'regressor')
RESET test
data: credittrue3
RESET = 1.9439, df1 = 32, df2 = 334, p-value = 0.002193
resettest(credittrue3, power = 2:3, type = 'fitted')
RESET test
data: credittrue3
RESET = 224.28, df1 = 2, df2 = 364, p-value < 2.2e-16
ncvTest(credittrue3)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 76.5797, Df = 1, p = < 2.22e-16
bptest(credittrue3)
studentized Breusch-Pagan test
data: credittrue3
BP = 97.504, df = 33, p-value = 2.733e-08
vif(credittrue3)
GVIF Df GVIF^(1/(2*Df))
bs(log(Income), df = 8) 49.39544 8 1.276014
bs(log(Limit), df = 8) 44.58831 8 1.267875
Student 198.92415 1 14.104047
bs(log(Income), df = 8):Student 5491.03400 8 1.712886
bs(log(Limit), df = 8):Student 5398.22700 8 1.711062
outlierTest(credittrue3)
infl <- data.frame(
hat = hatvalues(credittrue3),
cooks = cooks.distance(credittrue3),
rstandard = rstandard(credittrue3),
rstudent = rstudent(credittrue3)
)
infl <- infl[order(infl$cooks, decreasing = TRUE), ]
print(head(infl))
It seems as if we are still reaching the threshold for linearity. Since we are using robust standard-errors we can ignore the violation for homoscedasticity. We have some co-linearity with the basis-splines which is expected and we do see some outliers. Now, our assumption for normality is extremely off, but since that is our weakest assumption, we will work with it for now. The most important thing right now is to see what our variables are telling us.
print(summary(credittrue3))
Call:
lm(formula = Balance_log ~ bs(log(Income), df = 8) + bs(log(Limit),
df = 8) + bs(log(Income), df = 8) * Student + bs(log(Limit),
df = 8) * Student + Student, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-2.16630 -0.18665 0.04101 0.24140 1.21562
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0730 0.3188 6.503 2.58e-10 ***
bs(log(Income), df = 8)1 0.1604 0.3480 0.461 0.645099
bs(log(Income), df = 8)2 -0.2990 0.2278 -1.313 0.190169
bs(log(Income), df = 8)3 -0.3742 0.2282 -1.639 0.101974
bs(log(Income), df = 8)4 -0.7554 0.1958 -3.857 0.000136 ***
bs(log(Income), df = 8)5 -1.2432 0.2212 -5.620 3.80e-08 ***
bs(log(Income), df = 8)6 -2.2806 0.2954 -7.720 1.12e-13 ***
bs(log(Income), df = 8)7 -2.8351 0.4099 -6.916 2.08e-11 ***
bs(log(Income), df = 8)8 -3.1713 0.4918 -6.449 3.57e-10 ***
bs(log(Limit), df = 8)1 2.2016 0.4662 4.723 3.32e-06 ***
bs(log(Limit), df = 8)2 -1.7853 0.3199 -5.581 4.66e-08 ***
bs(log(Limit), df = 8)3 3.6373 0.2923 12.444 < 2e-16 ***
bs(log(Limit), df = 8)4 4.5735 0.2744 16.668 < 2e-16 ***
bs(log(Limit), df = 8)5 5.4081 0.2820 19.177 < 2e-16 ***
bs(log(Limit), df = 8)6 7.0795 0.3544 19.976 < 2e-16 ***
bs(log(Limit), df = 8)7 8.2955 0.4986 16.636 < 2e-16 ***
bs(log(Limit), df = 8)8 8.6352 0.6253 13.810 < 2e-16 ***
StudentYes 0.6996 1.0608 0.659 0.509993
bs(log(Income), df = 8)1:StudentYes -1.0899 1.9044 -0.572 0.567459
bs(log(Income), df = 8)2:StudentYes -0.3609 0.7457 -0.484 0.628682
bs(log(Income), df = 8)3:StudentYes -0.1328 0.9234 -0.144 0.885687
bs(log(Income), df = 8)4:StudentYes -0.1444 0.7419 -0.195 0.845798
bs(log(Income), df = 8)5:StudentYes 0.3374 0.8227 0.410 0.681985
bs(log(Income), df = 8)6:StudentYes 0.7923 1.3680 0.579 0.562834
bs(log(Income), df = 8)7:StudentYes 1.1386 1.0543 1.080 0.280879
bs(log(Income), df = 8)8:StudentYes 1.0165 1.1853 0.858 0.391669
bs(log(Limit), df = 8)1:StudentYes -1.5104 1.4748 -1.024 0.306469
bs(log(Limit), df = 8)2:StudentYes 5.7885 0.9480 6.106 2.61e-09 ***
bs(log(Limit), df = 8)3:StudentYes 0.5670 0.9328 0.608 0.543657
bs(log(Limit), df = 8)4:StudentYes 0.3660 0.8089 0.452 0.651227
bs(log(Limit), df = 8)5:StudentYes -0.3099 0.8294 -0.374 0.708904
bs(log(Limit), df = 8)6:StudentYes -0.7770 1.7036 -0.456 0.648608
bs(log(Limit), df = 8)7:StudentYes -2.2900 3.3167 -0.690 0.490337
bs(log(Limit), df = 8)8:StudentYes 0.4300 12.0706 0.036 0.971599
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4513 on 366 degrees of freedom
Multiple R-squared: 0.9434, Adjusted R-squared: 0.9383
F-statistic: 184.8 on 33 and 366 DF, p-value: < 2.2e-16
Anova(credittrue3, type = 'III')
Anova Table (Type III tests)
Response: Balance_log
Sum Sq Df F value Pr(>F)
(Intercept) 8.61 1 42.2899 2.584e-10 ***
bs(log(Income), df = 8) 91.45 8 56.1313 < 2.2e-16 ***
bs(log(Limit), df = 8) 962.56 8 590.7961 < 2.2e-16 ***
Student 0.09 1 0.4349 0.5100
bs(log(Income), df = 8):Student 2.23 8 1.3683 0.2089
bs(log(Limit), df = 8):Student 27.67 8 16.9812 < 2.2e-16 ***
Residuals 74.54 366
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
avPlots(credittrue3)
NA
NA
Due to the complexity of our GAMs and Cubic Basis-Splines, the best way to show the affects of this model is to utilize an effects plot for each of the complex spline variables, holding all other variables constant.
For the splines as discussed earlier, we will need to showcase an effects plot to show how the variables with splines behave with different inputs. The lone variable, Student, is more interpretable through our summary table, however, we can see that it has a p-value of .938. While the main effect of Student status is not significant, the ANOVA indicates that Student has a statistically significant impact on the response through its nonlinear interactions with Income and Limit. This means that Student status does not uniformly shift Balance levels, but it does alter the shape of the relationship between Income, Limit, and Balance.
plot(Effect('Income', credittrue))
plot(Effect('Limit', credittrue))
We see that with everything held constantly, as Income increases, the log of Balance seems to decrease which possibly tells us that the more money that one makes, the less need to utilize a lending product such as a credit card. This is, of course, being a non-student. We also see in the second effects plot that, while being a non-student, as the limit increases, the log of balance seems to increase. This can possibly tell us that, the more one has, the more they tend to spend without being too conscious of the fact that they have to pay back what they borrow. Of course, we can not come to that exact conclusion as to why the log of balance increase in this situation.
This is the best we have within the linear regression setting, but going back to our histogram from the very beginning, we notated a heavy right skew with a heavy concentration of values at 0 for the Balance distribution. That indicated that a GAMMA or Tweetie model may have been the more appropriate method, so we will build that one out before looking at prediction capabilities of our models. The Tweetie model has more relaxed assumptions and is far more accurate for situations like this.
powers <- seq(1.1, 1.8, by = 0.1)
#Credit$Limit_log <- log(Credit$Limit + 1)
p_prof <- tweedie.profile(Balance~Income*log(Limit)+bs(Rating, df = 4)+Cards+Age+Education+Gender+Student+Married+Ethnicity, data = Credit, p.vec = powers, control = glm.control(maxit = 10000))
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
........Done.
best_p <- p_prof$p.max
print(paste('best P:', best_p))
[1] "best P: 1.1"
credittweed <- glm(Balance~Income*log(Limit)+bs(Rating, df = 4)+Cards+Age+Education+Gender+Student+Married+Ethnicity, data = Credit, family = tweedie(var.power = best_p, link.power = 0))
summary(credittweed)
Call:
glm(formula = Balance ~ Income * log(Limit) + bs(Rating, df = 4) +
Cards + Age + Education + Gender + Student + Married + Ethnicity,
family = tweedie(var.power = best_p, link.power = 0), data = Credit)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.556e+01 3.826e+00 -4.066 5.80e-05 ***
Income -2.109e-01 2.455e-02 -8.594 < 2e-16 ***
log(Limit) 2.162e+00 5.383e-01 4.016 7.13e-05 ***
bs(Rating, df = 4)1 3.798e+00 8.777e-01 4.327 1.93e-05 ***
bs(Rating, df = 4)2 4.699e+00 1.286e+00 3.654 0.000294 ***
bs(Rating, df = 4)3 3.278e+00 1.471e+00 2.229 0.026389 *
bs(Rating, df = 4)4 2.168e+00 1.609e+00 1.348 0.178609
Cards 3.571e-02 1.281e-02 2.788 0.005561 **
Age -1.055e-03 9.453e-04 -1.116 0.264953
Educationmid education -1.988e-02 8.274e-02 -0.240 0.810249
Educationhigh education 3.504e-02 2.552e-01 0.137 0.890837
GenderFemale 6.291e-04 3.155e-02 0.020 0.984100
StudentYes 7.058e-01 4.438e-02 15.902 < 2e-16 ***
MarriedYes 4.762e-03 3.359e-02 0.142 0.887331
EthnicityAsian -1.483e-02 4.474e-02 -0.331 0.740522
EthnicityCaucasian 3.480e-02 3.880e-02 0.897 0.370295
Income:log(Limit) 2.228e-02 2.793e-03 7.975 1.77e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Tweedie family taken to be 25.86649)
Null deviance: 107208.3 on 399 degrees of freedom
Residual deviance: 8220.3 on 383 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 7
Anova(credittweed, test.statistic = 'Wald')
Analysis of Deviance Table (Type II tests)
Response: Balance
Df Chisq Pr(>Chisq)
Income 1 325.1954 < 2.2e-16 ***
log(Limit) 1 28.4638 9.546e-08 ***
bs(Rating, df = 4) 4 73.9545 3.315e-15 ***
Cards 1 7.7752 0.005297 **
Age 1 1.2463 0.264253
Education 2 0.1067 0.948034
Gender 1 0.0004 0.984090
Student 1 252.8854 < 2.2e-16 ***
Married 1 0.0201 0.887257
Ethnicity 2 1.8402 0.398480
Income:log(Limit) 1 63.6051 1.520e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We see our initial model and we also see that several of our variables are insignificant in this setting. We see an expected increase of about 7.2% in the balance when we go from 1 card to 2 cards, and an increase of about 7.7% when we go from 1 card to 3 cards. In fact we see an expected increase in balance when we go from one card to any amount of card. Keep in mind, these levels are only significant when we have 6 and 7 cards, but anything more or less than that reads as insignificant. We see several of our factors as insignificant like being married, ethnicity, education and gender. With income, each 1% increase in Income results in an expected decrease in balance of about 20%. Holding all things constant, with the log of limit we see an increase of 2.1% for every 1% increase in log of limit. The interaction between income and log of limit shows us that the effect of credit limit on balance depends on income and vice versa; as income increases, the percentage increase in expected balance associated with a 1% increase in credit limit becomes larger. In other words, the positive interaction between these two variables indicate that higher-income individuals exhibit a stronger increase in expected balance as credit limits rise, suggesting that additional credit availability is utilized more aggressively at higher income levels. As before, splines are not interpretable through here so we will show an effects plot at the end. Lastly, age indicates as age goes up 1%, we expect a .10% decrease in the expected balance. As always, we always want to diagnose all of our statistical models and this will be no different.
par(mfrow=c(2,2))
plot(credittweed)
idsx <- NULL
for(name in colnames(model.matrix(credittweed))[-1]){
idsx <- c(idsx, name)
}
par(mfrow=c(3,4))
for(i in idsx){
plot(model.matrix(credittweed)[,i], resid(credittweed, type = 'deviance'), xlab = i, ylab = 'Deviance Residuals')
abline(h=0, col = 'red')
}
par(mfrow=c(1,1))
print(paste("Deviance:", pchisq(deviance(credittweed), df.residual(credittweed), lower = FALSE)))
[1] "Deviance: 0"
idsx <- NULL
for( i in colnames(model.matrix(credittweed))[-1]){
idsx <- c(idsx, i)
}
par(mfrow = c(3,4))
for(i in idsx){
partial <- model.matrix(credittweed)[,i]*credittweed$coefficients[i]+residuals(credittweed, type = 'deviance')
plot(model.matrix(credittweed)[,i], partial, xlab = i, ylab = 'Partial Devaince Residuals')
abline(h = 0, col = 'red')
}
par(mfrow=c(1,1))
residualPlots(credittweed)
Test stat Pr(>|Test stat|)
Income 509.2707 < 2.2e-16 ***
log(Limit) 13.4767 0.0002415 ***
bs(Rating, df = 4)
Cards 10.9571 0.0009325 ***
Age 2.6649 0.1025860
Education
Gender
Student
Married
Ethnicity
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
print(sum(residuals(credittweed, type = 'pearson')^2)/(df.residual(credittweed)))
[1] 25.86674
vif(credittweed)
GVIF Df GVIF^(1/(2*Df))
Income 4883.331201 1 69.880836
log(Limit) 136.610146 1 11.688034
bs(Rating, df = 4) 1498.157928 4 2.494277
Cards 1.403529 1 1.184706
Age 1.107001 1 1.052141
Education 1.056146 2 1.013750
Gender 1.024729 1 1.012289
Student 1.120972 1 1.058760
Married 1.107885 1 1.052561
Ethnicity 1.096239 2 1.023237
Income:log(Limit) 5559.516689 1 74.562167
infl <- data.frame(
hat = hatvalues(credittweed),
cooks = cooks.distance(credittweed),
rstandard = rstandard(credittweed),
rstudent = rstudent(credittweed)
)
infl <- infl[order(infl$cooks, decreasing = TRUE),]
head(infl)
Our diagnostics check out several things; our deviance residuals vs fitted seems to show a claw despite most of the data being smooth. This claw around the lower fitted values can indicate structural zeros and if that is the case, we may have to move to a hurdle model which means we would need to create a logistic regression model to indicate whether an account will have a 0 or not and then use a gamma regression instead for our analysis since we will be utilizing a data set completely void of zeros. Other things to point is although our pearson residuals seem a bit flat, however our dispersion score is way above the acceptable value of 1 which could indicate some issues with our variance assumption as the variance must grow with the mean for gamma/tweedie. We also see some multi-collinearity between main affects of income, log of limit and their interaction value which is normal. We also see multi-collinearity with splines which is also normal. We also see we have several observations that are beyond the threshold for cook’s distance. Lastly, our deviance score was 0. That indicates that the model likelihood matches that of the saturated model. This is expected in a tweedie models when strong predictors perfectly explain the zero mass and the positive mean structure. As a result, deviance-based goodness-of-fit statistics are not informative in this setting.
We will move on to a hurdle model next.
Lastly, we can do an Elastic-Net regularization to see which terms are utilized the best for prediction. We will include every single term that we had in the credittrue model as our candidate model. Then we will utilize only the terms in the credittrue3 model to see if the simpler model predicts just as well as the more complex one. We will do an Elastic-Net pipeline that will allow us to select the best L1 ratio as well as the best alpha value for each model.
Credit$has_bal <-(Credit$Balance > 0)
creditbin <- glm(has_bal~Income+Limit+Cards+Age+Education+Gender+Married+Ethnicity, data = Credit, family = 'binomial')
print(summary(creditbin))
Call:
glm(formula = has_bal ~ Income + Limit + Cards + Age + Education +
Gender + Married + Ethnicity, family = "binomial", data = Credit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.758e+00 2.763e+00 -3.532 0.000413 ***
Income -1.125e-01 2.157e-02 -5.215 1.84e-07 ***
Limit 3.891e-03 5.358e-04 7.263 3.79e-13 ***
Cards 3.149e-01 2.250e-01 1.399 0.161688
Age 1.572e-03 1.570e-02 0.100 0.920244
Educationmid education 1.682e-01 2.011e+00 0.084 0.933308
Educationhigh education 1.377e+01 1.472e+03 0.009 0.992536
GenderFemale 7.651e-01 5.293e-01 1.446 0.148270
MarriedYes -7.184e-02 5.284e-01 -0.136 0.891845
EthnicityAsian 2.515e-01 6.919e-01 0.363 0.716287
EthnicityCaucasian 8.631e-01 6.351e-01 1.359 0.174164
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 426.53 on 399 degrees of freedom
Residual deviance: 103.74 on 389 degrees of freedom
AIC: 125.74
Number of Fisher Scoring iterations: 15
Anova(creditbin, test.statistic = 'Wald')
Analysis of Deviance Table (Type II tests)
Response: has_bal
Df Chisq Pr(>Chisq)
Income 1 27.1933 1.841e-07 ***
Limit 1 52.7488 3.791e-13 ***
Cards 1 1.9584 0.1617
Age 1 0.0100 0.9202
Education 2 0.0071 0.9965
Gender 1 2.0900 0.1483
Married 1 0.0185 0.8918
Ethnicity 2 2.0361 0.3613
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
par(mfrow=c(2,2))
plot(creditbin)
binnedplot(predict(creditbin), residuals(creditbin))
binnedplot(creditbin$fitted.values, model.response(model.frame(creditbin)), xlab = 'Mean Predicted Probability', ylab = 'Observed Fraction', main = 'Calibration Plot')
abline(0, 1, lty = 2)
idsx <- NULL
for(i in colnames(model.matrix(creditbin))[-1]){
idsx <- c(idsx,i)
}
par(mfrow = c(3,4))
for(i in idsx){
binnedplot(model.matrix(creditbin)[,i], residuals(creditbin), xlab = i, ylab = 'Deviance Residuals (binned)')
abline(h = 0, col = 'red')
}
par(mfrow = c(1,1))
idsx <- NULL
for(i in colnames(model.matrix(creditbin))[-1]){
idsx <- c(idsx, i)
}
par(mfrow=c(3,4))
for(i in idsx){
partial <- model.matrix(creditbin)[,i]*creditbin$coefficients[i] + residuals(creditbin, type = 'working')
plot(model.matrix(creditbin)[,i], partial, xlab = i, ylab = 'Partial Residuals')
abline(h = 0, col = 'red')
}
par(mfrow=c(1,1))
residualPlots(creditbin)
Test stat Pr(>|Test stat|)
Income 0.5309 0.46623
Limit 10.7020 0.00107 **
Cards 0.5305 0.46639
Age 0.0854 0.77015
Education
Gender
Married
Ethnicity
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
print(sum(residuals(creditbin, type = 'pearson')^2)/df.residual(creditbin))
[1] 1.701076
vif(creditbin)
GVIF Df GVIF^(1/(2*Df))
Income 2.586828 1 1.608362
Limit 2.672285 1 1.634713
Cards 1.068791 1 1.033823
Age 1.094640 1 1.046250
Education 1.029616 2 1.007323
Gender 1.075508 1 1.037067
Married 1.085895 1 1.042063
Ethnicity 1.181848 2 1.042654
infl <- data.frame(
hat = hatvalues(creditbin),
cooks = cooks.distance(creditbin),
rstandard = rstandard(creditbin),
rstudent = rstudent(creditbin)
)
infl <- infl[order(infl$cooks, decreasing = TRUE), ]
head(infl)
The diagnostic deviance plot shows that we have our fitted values centered around 0, however we do seem to be off in some areas, however we do not see a trend or any obvious curvature. Our QQ plot, although not as appropriate as in the OLS setting, shows that we have no outliers that may be 4 standard deviations away. Our calibration plot also seems to be well aligned. We also see very little outliers in general although there are two that has appeared in the tweedie model as well, observations 205 and 383. Our deviance vs covariate plots show exactly what was confirmed by the summary plot in regards to important variables as the binned deviance plot for the covariates show limit and income being strongly centered around 0, age looking more flat and less concentrated. We do not see an obvious curve or trend in any of the plots. Our pearson residual plots also show a flat cloud and combined with a dispersion parameter of about 1.4, there isn’t too much to worry about. We also do not have any multi-collinearity.
y_true <- Credit$has_bal
y_score <- predict(creditbin, type = 'response')
pr <- pr.curve(scores.class0 = y_score[y_true == TRUE],
scores.class1 = y_score[y_true == FALSE],
curve = TRUE)
plot(pr)
pr$auc.integral
[1] 0.9964298
library(caret)
cands <- quantile(y_score, seq(0.05, 0.95, by = 0.05))
results <- data.frame()
for(t in cands){
pred <- factor(ifelse(y_score > t, 1, 0), levels = c(0,1))
truth <- factor(y_true, levels = c(0,1))
cm <- confusionMatrix(pred, truth, positive = '1')
results <- rbind(
results,
data.frame(
threshold = t,
precision = cm$byClass["Precision"],
recall = cm$byClass["Recall"],
F1 = cm$byClass["F1"]
)
)
}
results
The Precision-Recall curve shows near-perfect separation between classes, with an average precision of .997. Precision remains above 99% across a wide range of recall values, indicating that the model maintains a very low false-positive rate even when aggressively identifying positive cases. This is especially meaningful given class imbalance, where PR-AUC is more informative than ROC-AUC.
Now, we will create a gamma model, which is a tweedie model with powers = 2. This will only use positive balance information.
cred <- Credit[Credit$Balance > 0,]
plot(cred)
summary(cred)
ID Income Limit Rating Cards Age
Min. : 1.00 Min. : 10.35 Min. : 1160 Min. :126.0 Min. :1.000 Min. :23.00
1st Qu.: 98.25 1st Qu.: 23.15 1st Qu.: 3976 1st Qu.:304.0 1st Qu.:2.000 1st Qu.:42.00
Median :209.50 Median : 37.14 Median : 5147 Median :380.0 Median :3.000 Median :55.50
Mean :202.44 Mean : 49.98 Mean : 5485 Mean :405.1 Mean :2.997 Mean :55.61
3rd Qu.:306.50 3rd Qu.: 63.74 3rd Qu.: 6453 3rd Qu.:469.0 3rd Qu.:4.000 3rd Qu.:69.00
Max. :400.00 Max. :186.63 Max. :13913 Max. :982.0 Max. :9.000 Max. :98.00
Education Gender Student Married Ethnicity Balance Balance_plus
low education : 12 Male :145 No :271 No :118 African American: 78 Min. : 5.0 Min. : 15.0
mid education :296 Female:165 Yes: 39 Yes:192 Asian : 74 1st Qu.: 338.0 1st Qu.: 348.0
high education: 2 Caucasian :158 Median : 637.5 Median : 647.5
Mean : 671.0 Mean : 681.0
3rd Qu.: 960.8 3rd Qu.: 970.8
Max. :1999.0 Max. :2009.0
Balance_log Balance_sqrt Education_t has_bal
Min. :2.708 Min. : 2.236 low education : 12 Mode:logical
1st Qu.:5.852 1st Qu.:18.385 mid education :296 TRUE:310
Median :6.473 Median :25.249 high education: 2
Mean :6.256 Mean :24.411
3rd Qu.:6.878 3rd Qu.:30.996
Max. :7.605 Max. :44.710
creditgam <- glm(Balance~Income*log(Limit)+bs(Rating, df = 4)+Cards+Age+Education+Gender+Student+Married+Ethnicity, data = cred, family = Gamma(link = 'log'))
print(summary(creditgam))
Call:
glm(formula = Balance ~ Income * log(Limit) + bs(Rating, df = 4) +
Cards + Age + Education + Gender + Student + Married + Ethnicity,
family = Gamma(link = "log"), data = cred)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.368e+01 3.073e+00 -4.452 1.21e-05 ***
Income -2.022e-01 2.225e-02 -9.087 < 2e-16 ***
log(Limit) 2.438e+00 4.241e-01 5.747 2.27e-08 ***
bs(Rating, df = 4)1 -3.046e-01 3.916e-01 -0.778 0.437240
bs(Rating, df = 4)2 4.867e-01 7.903e-01 0.616 0.538482
bs(Rating, df = 4)3 -1.057e+00 9.718e-01 -1.088 0.277491
bs(Rating, df = 4)4 -1.855e+00 1.114e+00 -1.665 0.096962 .
Cards 4.221e-02 1.180e-02 3.576 0.000407 ***
Age -2.541e-03 8.641e-04 -2.940 0.003541 **
Educationmid education -1.125e-02 7.539e-02 -0.149 0.881437
Educationhigh education 3.109e-02 1.948e-01 0.160 0.873291
GenderFemale -1.382e-02 2.894e-02 -0.478 0.633293
StudentYes 9.716e-01 4.750e-02 20.453 < 2e-16 ***
MarriedYes -1.491e-02 3.078e-02 -0.484 0.628527
EthnicityAsian 3.024e-02 4.175e-02 0.724 0.469500
EthnicityCaucasian 7.308e-02 3.546e-02 2.061 0.040215 *
Income:log(Limit) 2.116e-02 2.557e-03 8.275 4.59e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.06341369)
Null deviance: 179.312 on 309 degrees of freedom
Residual deviance: 26.298 on 293 degrees of freedom
AIC: 4011.4
Number of Fisher Scoring iterations: 8
With the exact same variables as the tweedie model, we see that the same variables are significant, but we also see some new ones that now matter. Age seems to now matter. Going from Black to Asian, doesn’t matter, but going from Black to White does. Having more than 3 cards also seems to be significant. We also see that our deviance value has dramatically decreased where before it was 8,126 and now it is 26.11. This model is significantly better. Now for the rest of our diagnostics.
par(mfrow = c(2,2))
plot(creditgam)
idsx <- NULL
for(i in colnames(model.matrix(creditgam))[-1]){
idsx <- c(idsx, i)
}
par(mfrow = c(3,4))
for(i in idsx){
plot(model.matrix(creditgam)[,i], residuals(creditgam, type = 'deviance'), xlab = i, ylab = 'Deviance Residuals', main = 'Deviance Residuals')
}
par(mfrow = c(1,1))
print(pchisq(deviance(creditgam), df.residual(creditgam), lower = FALSE))
[1] 1
idsx <- NULL
for(i in colnames(model.matrix(creditgam))[-1]){
idsx <- c(idsx, i)
}
par(mfrow = c(3,4))
for(i in idsx){
partial <- model.matrix(creditgam)[,i] * creditgam$coefficients[i] + residuals(creditgam, type = 'deviance')
plot(model.matrix(creditgam)[,i], partial, xlab = i, ylab = 'Partial Residuals', main = 'Partial Residuals')
}
par(mfrow = c(1,1))
residualPlots(creditgam)
Test stat Pr(>|Test stat|)
Income 0.4286 0.5127
log(Limit) 0.6270 0.4285
bs(Rating, df = 4)
Cards 0.0675 0.7950
Age 0.0045 0.9464
Education
Gender
Student
Married
Ethnicity
print(sum(residuals(creditgam, type = 'pearson')^2)/df.residual(creditgam))
[1] 0.06341369
vif(creditgam)
GVIF Df GVIF^(1/(2*Df))
Income 3462.282735 1 58.841165
log(Limit) 120.196802 1 10.963430
bs(Rating, df = 4) 1278.347077 4 2.445294
Cards 1.381528 1 1.175384
Age 1.094224 1 1.046052
Education 1.066712 2 1.016276
Gender 1.018975 1 1.009443
Student 1.213236 1 1.101470
Married 1.091797 1 1.044891
Ethnicity 1.070604 2 1.017202
Income:log(Limit) 3938.880434 1 62.760501
infl <- data.frame(
hat = hatvalues(creditgam),
cooks = cooks.distance(creditgam),
rstandard = rstandard(creditgam),
rstudent = rstudent(creditgam)
)
infl <- infl[order(infl$cooks, decreasing = TRUE),]
head(infl)
Our diagnostics look even better than the tweedie. We do not see any trend or obvious curvature in the deviance residuals and even have a deviance score of 1, much better than the 0 we received for the tweedie model. We see that we have no curvature or obvious trends when it comes to the our deviance residuals as well so linearity checks out. Our pearson residual plots show a flat cloud indicating that the variance assumption is met and is supported by dispersion score that is below 1 meaning that we may have overfit our model. Regardless, we can go forward with the summary and interpretations.
print(summary(creditgam))
Call:
glm(formula = Balance ~ Income * log(Limit) + bs(Rating, df = 4) +
Cards + Age + Education + Gender + Student + Married + Ethnicity,
family = Gamma(link = "log"), data = cred)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.368e+01 3.073e+00 -4.452 1.21e-05 ***
Income -2.022e-01 2.225e-02 -9.087 < 2e-16 ***
log(Limit) 2.438e+00 4.241e-01 5.747 2.27e-08 ***
bs(Rating, df = 4)1 -3.046e-01 3.916e-01 -0.778 0.437240
bs(Rating, df = 4)2 4.867e-01 7.903e-01 0.616 0.538482
bs(Rating, df = 4)3 -1.057e+00 9.718e-01 -1.088 0.277491
bs(Rating, df = 4)4 -1.855e+00 1.114e+00 -1.665 0.096962 .
Cards 4.221e-02 1.180e-02 3.576 0.000407 ***
Age -2.541e-03 8.641e-04 -2.940 0.003541 **
Educationmid education -1.125e-02 7.539e-02 -0.149 0.881437
Educationhigh education 3.109e-02 1.948e-01 0.160 0.873291
GenderFemale -1.382e-02 2.894e-02 -0.478 0.633293
StudentYes 9.716e-01 4.750e-02 20.453 < 2e-16 ***
MarriedYes -1.491e-02 3.078e-02 -0.484 0.628527
EthnicityAsian 3.024e-02 4.175e-02 0.724 0.469500
EthnicityCaucasian 7.308e-02 3.546e-02 2.061 0.040215 *
Income:log(Limit) 2.116e-02 2.557e-03 8.275 4.59e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.06341369)
Null deviance: 179.312 on 309 degrees of freedom
Residual deviance: 26.298 on 293 degrees of freedom
AIC: 4011.4
Number of Fisher Scoring iterations: 8
Anova(creditgam, test.statistic = "Wald")
Analysis of Deviance Table (Type II tests)
Response: Balance
Df Chisq Pr(>Chisq)
Income 1 565.0851 < 2.2e-16 ***
log(Limit) 1 38.5311 5.389e-10 ***
bs(Rating, df = 4) 4 26.3330 2.711e-05 ***
Cards 1 12.7913 0.0003482 ***
Age 1 8.6448 0.0032800 **
Education 2 0.0749 0.9632255
Gender 1 0.2281 0.6329374
Student 1 418.3218 < 2.2e-16 ***
Married 1 0.2346 0.6281650
Ethnicity 2 4.5687 0.1018398
Income:log(Limit) 1 68.4693 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(effect('Rating', creditgam))
With the exact same variables as the tweedie model, we see that the same variables are significant, but we also see some new ones that now matter. Age seems to now matter with each 1% increase in age coinciding with a .26% decrease in expected Balance holding all else constant.. Having more than 3 cards also seems to be significant as each level past two cards seems to rise. If you are a student, expect your balance to rise about 97% and for each percentage increase in income, your expected balance will drop by 20%. Every 1% increase in the log of limit coincides with a 2.4% increase in the expected value of Balance. Once again, with the interaction term, it tells us that the higher the income you are, your expected increase in balance for each percentage increase in limit increases.With rating, we are using a cubic basis-spline, visually we can see how that variable behaves. At low values of rating, we have a high variance, but as rating’s score increases, we see the variance tighten up, especially at ratings of 400 the balance is around 500. We also see that our deviance value has dramatically decreased where before it was 8,126 and now it is 26.11. This model is significantly better.
We can conclude that our Gamma regression and our hurdle model overall is a significantly better model than the tweedie model due to the structural zeros in the data. However, is it a better model than the linear regressions that we created? Inference wise, there is no doubt that it is better, but the best way to truly confirm which is a better model for prediction purposes is through cross validation.
The first part of the cross validation, we will compare the full credittrue model to the reduced credittrue3 model, to further see if the larger model is better since we know due to the ANOVA that the larger model was a better model when it came to inferences. Now, we will compare their prediction ability.
set.seed(0)
split_obj <- initial_split(Credit, prop = 0.80)
train_df <- training(split_obj)
test_df <- testing(split_obj)
enet_pipe <- linear_reg(penalty = tune(), mixture = tune()) %>% set_engine('glmnet')
wf_full <- workflow() %>% add_model(enet_pipe) %>% add_formula(Balance_log~bs(log(Income), df = 8)+bs(log(Limit), df = 8)+bs(Rating, df = 4)+bs(log(Income), df = 8)*Student+bs(log(Limit), df = 8)*Student+Cards+Age+Education+Gender+Student+Married+Ethnicity)
wf_reduced <- workflow() %>% add_model(enet_pipe) %>% add_formula(Balance_log~bs(log(Income), df = 8)+bs(log(Limit), df = 8)+bs(log(Income), df = 8)*Student+bs(log(Limit), df = 8)*Student+Student)
folds <- vfold_cv(train_df, v = 10)
mset <- metric_set(rsq, mae, yardstick::rmse)
grid <- grid_latin_hypercube(
penalty(range = c(-6, 1)),
mixture(range = c(0, 1)),
size = 50
)
ctrl <- control_grid(save_pred = TRUE)
tune_full <- tune_grid(
wf_full, resamples = folds, grid = grid,
metrics = mset, control = ctrl
)
tune_reduced <- tune_grid(
wf_reduced, resamples = folds, grid = grid,
metrics = mset, control = ctrl
)
best_full <- select_best(tune_full, metric = 'rsq')
best_reduced <- select_best(tune_reduced, metric = 'rsq')
print("best alpha/lambda full:")
[1] "best alpha/lambda full:"
print(best_full)
print("best alpha/lambda reduced:")
[1] "best alpha/lambda reduced:"
print(best_reduced)
print("R2, MAE, RMSE")
[1] "R2, MAE, RMSE"
print(collect_metrics(tune_full)[collect_metrics(tune_full)$penalty == best_full$penalty, ])
print("R2, MAE, RMSE")
[1] "R2, MAE, RMSE"
print(collect_metrics(tune_reduced)[collect_metrics(tune_reduced)$penalty == best_reduced$penalty, ])
final_full_fit <-
finalize_workflow(wf_full, best_full) %>%
fit(train_df)
final_reduced_fit <-
finalize_workflow(wf_reduced, best_reduced) %>%
fit(train_df)
pred_full <- predict(final_full_fit, test_df) %>% bind_cols(test_df %>% select(Balance_log))
pred_red <- predict(final_reduced_fit, test_df) %>% bind_cols(test_df %>% select(Balance_log))
print("Hold Out Metrics")
[1] "Hold Out Metrics"
mset(pred_full, truth = Balance_log, estimate = .pred)
mset(pred_red, truth = Balance_log, estimate = .pred)
NA
tol <- 1e-6
coef_full <-
final_full_fit %>%
extract_fit_parsnip() %>%
tidy() %>%
filter(term != "(Intercept)")
chosen_full <- coef_full %>% filter(abs(estimate) > tol)
neglected_full <- coef_full %>% filter(abs(estimate) <= tol)
print("Chosen (Full Model):")
[1] "Chosen (Full Model):"
print(chosen_full)
print("Neglected (Full Model:")
[1] "Neglected (Full Model:"
print(neglected_full)
coef_reduced <-
final_reduced_fit %>%
extract_fit_parsnip %>%
tidy() %>%
filter(term != "(Intercept)")
chosen_reduced <- coef_reduced %>% filter(abs(estimate) > tol)
neglected_reduced <- coef_reduced %>% filter(abs(estimate) <= tol)
print("Chosen (Reduced Model):")
[1] "Chosen (Reduced Model):"
print(chosen_reduced)
print("Neglected (Reduced Model):")
[1] "Neglected (Reduced Model):"
print(neglected_reduced)
We can see that the more complex model, the credittrue one, barely outperformed the more simpler model. In several instances, the simpler model actually performed better when it came to k-fold cross validation. However, both the ANOVA and the hold out metrics tell us to go with the more complex linear regression model. We also see that the best regularization parameter says to use a full lasso model. Now, to compare the linear regression model with our GAMMA-Hurdle model.
form_lin = Balance_log~bs(log(Income), df = 8)+bs(log(Limit), df = 8)+bs(Rating, df = 4)+bs(log(Income), df = 8)*Student+bs(log(Limit), df = 8)*Student+Cards+Age+Education+Gender+Student+Married+Ethnicity
form_bin = has_bal~Income+Limit+Cards+Age+Education+Gender+Married+Ethnicity
form_gam = Balance~Income*log(Limit)+bs(Rating, df = 4)+Cards+Age+Student+Married+Ethnicity
set.seed(100)
rmse <- function(y, yhat) sqrt(mean((y - yhat)^2, na.rm = TRUE))
gamma_deviance <- function(y, mu){
eps <- 1e-12
y <- pmax(y, eps)
mu <- pmax(mu, eps)
mean(2 * (-log(y / mu) + (y - mu) / mu))
}
folds <- vfold_cv(Credit, v = 10)
factor_vars <- c("Student", "Gender", "Married", "Ethnicity")
lvl_list <- lapply(Credit[factor_vars], function(x) levels(factor(x)))
out <- purrr::map_dfr(seq_along(folds$splits), function(i) {
sp <- folds$splits[[i]]
tr <- rsample::analysis(sp)
te <- rsample::assessment(sp)
for (v in factor_vars) {
tr[[v]] <- factor(tr[[v]], levels = lvl_list[[v]])
te[[v]] <- factor(te[[v]], levels = lvl_list[[v]])
}
x_tr <- model.matrix(form_lin, tr)[, -1, drop = FALSE]
y_tr <- tr$Balance_log
x_te <- model.matrix(form_lin, te)[, -1, drop = FALSE]
alphas <- seq(0, 1, by = 0.01)
cv_fits <- lapply(alphas, function(a) {
cv.glmnet(
x = x_tr, y = y_tr,
alpha = a,
nfolds = 10,
standardize = TRUE
)
})
best_idx <- which.min(sapply(cv_fits, function(f) min(f$cvm)))
best_cv <- cv_fits[[best_idx]]
yhat_log <- as.numeric(predict(best_cv, newx = x_te, s = 'lambda.min'))
yhat_lin <- exp(yhat_log)
rmse_lin <- rmse(te$Balance, yhat_lin)
fit_bin <- glm(form_bin, data = tr, family = binomial())
p_pos <- predict(fit_bin, newdata = te, type = 'response')
tr_pos <- tr %>% filter(Balance > 0)
fit_gam <- glm(form_gam, data = tr_pos, family = Gamma(link = 'log'))
mu_pos <- predict(fit_gam, newdata = te, type = 'response')
yhat_hurdle <- p_pos * mu_pos
rmse_hurdle <- rmse(te$Balance, yhat_hurdle)
te_pos <- te %>% filter(Balance > 0)
mu_pos_only <- predict(fit_gam, newdata = te_pos, type = 'response')
gdev <- gamma_deviance(te_pos$Balance, mu_pos_only)
tibble(
fold = i,
rmse_linear = rmse_lin,
rmse_hurdle = rmse_hurdle,
gamma_dev_pos = gdev,
best_alpha = alpha(best_idx),
best_lambda = best_cv$lambda.min
)
})
out_summary <- out %>%
summarise(
rmse_linear_mean = mean(rmse_linear),
rmse_linear_sd = sd(rmse_linear),
rmse_hurdle_mean = mean(rmse_hurdle),
rmse_hurdle_sd = sd(rmse_hurdle),
gamma_dev_pos_mean = mean(gamma_dev_pos),
gamma_dev_pos_sd = sd(gamma_dev_pos)
)
out
out_summary